The Penn Treebank: an Overview

نویسندگان

  • Ann Taylor
  • Mitchell Marcus
  • Beatrice Santorini
  • A. TAYLOR
  • M. MARCUS
  • B. SANTORINI
چکیده

The Penn Treebank, in its eight years of operation (1989-1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proposition Bank II: Delving Deeper

The PropBank project is creating a corpus of text annotated with information about basic semantic propositions. PropBank I (Kingsbury & Palmer, 2002) added a layer of predicateargument information, or semantic roles, to the syntactic structures of the English Penn Treebank. This paper presents an overview of the second phase of PropBank Annotation, PropBank II, which is being applied to English...

متن کامل

CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word–word dependencies. The resulting corpus,CCGbank,includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium,and has been used to train widecoverage statistical parsers that ob...

متن کامل

Identifying Verb Arguments and their Syntactic Function in the Penn Treebank

In this paper, we present a tool that allows one to automatically extract verb argument-structure from the Penn Treebank as well as from other corpora annotated with the Penn Treebank release 2 conventions. More specifically, we examine each possible sequence of tags, both functional and categorial and determine whether such a sequence indicates an obligatory argument, an optional argument or a...

متن کامل

Annotating Discourse Relations with the PDTB Annotator

The PDTB Annotator is a tool for annotating and adjudicating discourse relations based on the annotation framework of the Penn Discourse TreeBank (PDTB). This demo describes the benefits of using the PDTB Annotator, gives an overview of the PDTB Framework and discusses the tool’s features, setup requirements and how it can also be used for adjudication.

متن کامل

Searching in the Penn Discourse Treebank Using the PML-Tree Query

The PML-Tree Query is a general, powerful and user-friendly system for querying richly linguistically annotated treebanks. The present paper shows how the PML-Tree Query can be used for searching for discourse relations in the Penn Discourse Treebank 2.0 mapped onto the syntactic annotation of the Penn Treebank.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003